Skip to content

feat(yp): faster typescript serialization#23713

Open
AztecBot wants to merge 10 commits into
nextfrom
cb/spike-tx-write-interface
Open

feat(yp): faster typescript serialization#23713
AztecBot wants to merge 10 commits into
nextfrom
cb/spike-tx-write-interface

Conversation

@AztecBot
Copy link
Copy Markdown
Collaborator

@AztecBot AztecBot commented May 29, 2026

Result: 3-4x faster typescript serialization.

Give another mode to toBuffer that, if a buffer 'sink' is passed, write to that instead. Thus a top-level toBuffer() call, instead of

a.toBuffer() = a.b.toBuffer() + a.c.toBuffer()

we have, approximately

a.toBuffer() = {
  sink = new Sink()
  a.b.toBuffer(sink)
  a.c.toBuffer(sink)
  sink.toBuffer();
}

This allows use to avoid intermediate buffer allocations and additionally we did some trial and error to speed things up on optimizations done by V8, which are hopefully representative.

@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 29, 2026
@AztecBot AztecBot changed the title spike: streaming .write serialization for the Tx.toBuffer recursive path spike: streaming toBuffer(sink?) serialization for the Tx path (~11x) May 29, 2026
@ludamad ludamad added ci-draft Run CI on draft PRs. ci-full Run all master checks. labels May 29, 2026
@ludamad ludamad marked this pull request as ready for review May 29, 2026 22:34
@AztecBot AztecBot changed the title spike: streaming toBuffer(sink?) serialization for the Tx path (~11x) refactor(stdlib): streaming toBuffer(sink?) for the Tx path (~11x) May 29, 2026
@AztecBot AztecBot force-pushed the cb/spike-tx-write-interface branch 3 times, most recently from cfdb8a1 to 080c41d Compare May 29, 2026 23:19
…path

Replace the recursive Tx.toBuffer() chain (Buffer alloc at every node,
Buffer.concat at every level) with a single growable ArrayBuffer the whole
object graph streams into and that is sliced once at the root.

The migration contract is the optional-sink overload:
  toBuffer(): Buffer;
  toBuffer(sink: BufferSink): void;
Pass a sink and it writes + returns undefined; omit it and it returns its
own buffer. Unmigrated children fall back via return value, so it lands
incrementally and existing toBuffer() callers keep working.

Converts the Tx spine end-to-end: Tx/TxArray, TxHash/TxHashArray,
PrivateKernelTailCircuitPublicInputs (+partials), PrivateToRollupAccumulatedData,
ChonkProof/ChonkProofWithPublicInputs, HashedValues, Vector, BaseField.

BufferSink.writeBigInt uses 4x DataView.setBigUint64 limbs for 32-byte
fields (no hex round-trip, no per-field alloc). On a modeled rollup Tx
(~2660 fields) byte-identical to today and ~11x faster end-to-end; the
naive per-byte shift loop is actually slower than legacy, so picking the
right field encoder is the win.

Adds toBuffer cases (private + public) to stdlib/src/tx/tx_bench.test.ts
recording per-op microseconds + payload bytes; wired into CI via the
existing bench_cmds entry, dashboard series Tx/{private,public}/toBuffer/*.

fromBuffer/zod path is unchanged and out of scope.
@AztecBot AztecBot force-pushed the cb/spike-tx-write-interface branch from 080c41d to f7e7ba2 Compare May 30, 2026 12:26
AztecBot added 4 commits May 31, 2026 02:58
…r ~9x Tx.toBuffer

Real bench (stdlib/src/tx/tx_bench.test.ts) on this PR's spine-only baseline vs after:
- Tx/private/toBuffer:           1.96 ms -> 0.22 ms (~8.9x)
- Tx/public/toBuffer:            3.11 ms -> 0.34 ms (~9.1x)
- Tx/private/toBufferReusedSink: 1.86 ms -> 0.16 ms (~12x)
- Tx/public/toBufferReusedSink:  3.04 ms -> 0.29 ms (~10.5x)

cpu-prof on the prior code showed serializeToSink dominating ~50% of total time:
the rest-args + Array.isArray + Buffer.isBuffer + 5x typeof dispatch ran per element
of every nested array, and serializeToSink(sink, ...obj) allocated a fresh
rest-args array for each recursion (1632-element spread per ChonkProof, every call).

Changes:

- foundation/serialize/buffer_sink: split dispatch into per-element serializeOneToSink
  and an inner serializeArrayToSinkInner that recurses with the array reference, no
  spread. Hot-path objects exposing toBuffer first so Fr/Fq/migrated leaves skip the
  primitive-typeof chain. serializeArrayToSink uses the same inner.

- foundation/curves/bn254/field: BaseField caches its 32-byte serialized form. The
  cache is populated eagerly in the constructor when built from a 32-byte Buffer (the
  deserialization path) and lazily on first toBuffer otherwise. toBuffer returns a
  defensive Buffer.from copy or writes the cached bytes straight into a sink, with no
  bigint->bytes round-trip on the hot path. The Buffer ctor copies via new Uint8Array
  to defend against caller-side mutation; the copy-ctor aliases the cache (it is never
  mutated post-assignment).

- stdlib/tx/tx: pre-size the BufferSink with the last serialized length so the
  no-sink fresh-allocation path skips the 1k->64k doubling-growth cost. Hint lives
  in a module-level WeakMap rather than an instance field so deep-equality assertions
  on Tx (which compare enumerable own properties) are unaffected.
…ench

The previous commit's BaseField byte cache helped the existing steady-state bench
(2051 calls per Tx) but adds a 32-byte Uint8Array alloc plus a Buffer.from copy on
every cold-path Fr.toBuffer call. The synthetic bench was a misleading measurement
since prod typically serializes each Tx once.

Measured impact, dispatch fix + sink presize only (this commit) vs. with the cache:

  variant            no cache   with cache
  private steady     0.28 ms    0.22 ms   (cache +20%)
  private cold       0.31 ms    1.00 ms   (cache -3.2x, real regression)
  public  steady     0.38 ms    0.34 ms
  public  cold       0.37 ms    ~1 ms

The dispatch fix + WeakMap sink-size hint already give ~7-10x vs. the spine-only
baseline (1.94 ms / 3.22 ms) without any state held on Fr instances, no
deserialize-time copies, no extra memory per long-lived Tx in the mempool.

Also adds two cold-start bench cases (one toBuffer per fresh Tx, no warm cache,
no sink reuse) so the dashboard tracks the realistic per-tx cost alongside the
steady-state numbers, and a future byte-cache attempt can be evaluated honestly.
…hint

Three small adds on top of the dispatch-fix + sink-presize commits, all without
the byte-cache tradeoff:

- foundation/serialize/buffer_sink: split a no-width writeField(value) off
  writeBigInt so V8 can specialize the Fr/Fq 32-byte limb encoder without the
  wider routine's width branch. Add writeFields(arr) which iterates a flat
  field-element array inline with no per-element Sinkable dispatch.

- foundation/curves/bn254/field: BaseField.toBuffer(sink) now calls writeField.

- stdlib/proofs/chonk_proof: both proof classes switch the 1632-element field
  vector from serializeToSink(... this.fields) to sink.writeFields(this.fields),
  skipping per-Fr serializeOneToSink dispatch on the largest leaf array in a Tx.

- stdlib/tx/tx: fall back to a process-wide largest-seen Tx size when the
  per-instance WeakMap sink-size hint is missing, so the no-sink fresh-allocation
  cold path (different Tx every call) also benefits from sink pre-sizing once any
  Tx in the process has been serialized.

Best-of-3 AVG us/op vs. spine-only baseline (1940 / 3220):

  variant           current  spine baseline  speedup
  private steady    220      1940            ~8.8x
  public  steady    325      3220            ~9.9x
  private reused    167      1860            ~11.1x
  public  reused    266      3040            ~11.4x
  private cold      ~275     -               -
  public  cold      ~445     -               -

Cold numbers are inherently noisier (each timed call serializes a different Tx
with different field shapes, so V8 inline caches churn) but stay well below
the steady baseline.
@ludamad ludamad changed the title refactor(stdlib): streaming toBuffer(sink?) for the Tx path (~11x) feat(yp): faster typescript serde May 31, 2026
@ludamad ludamad changed the title feat(yp): faster typescript serde feat(yp): faster typescript serialization May 31, 2026
Copy link
Copy Markdown
Contributor

@alexghr alexghr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ludamad ludamad enabled auto-merge May 31, 2026 16:32
@ludamad ludamad added claudebox Owned by claudebox. it can push to this PR. and removed ci-draft Run CI on draft PRs. claudebox Owned by claudebox. it can push to this PR. labels May 31, 2026
AztecBot and others added 3 commits May 31, 2026 16:45
Adds buffer_sink.test.ts covering the new BufferSink module: byte-for-byte
equivalence of every sink writer against the legacy serializeBigInt/free_funcs,
serializeToSink dispatch (mixed/nested/migrated/legacy-node fallback), capacity
growth, reset reuse, overflow/negative guards, and a sink->BufferReader round-trip.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The arbitrary-width branch of writeBigInt used a per-byte BigInt shift loop,
which benchmarked as the slowest option (slower than the legacy hex round-trip)
because each byte allocates a fresh BigInt. Replace it with 64-bit setBigUint64
limbs written from the least-significant tail, plus a <=7-byte leftover head for
widths that aren't a multiple of 8. Faster than the legacy path at every width;
multiples of 8 (the only widths used in practice: 8/16/32) take the pure-limb
path. The width===32 unrolled fast path is retained. Extends the width coverage
in buffer_sink.test.ts.
Comment thread yarn-project/stdlib/src/tx/tx.ts Outdated

// Per-instance sink size hint. Held externally (WeakMap) so it does not appear as an enumerable instance
// field, which would otherwise make deep-equality assertions fail when one side has been serialized.
const txSizeHints = new WeakMap<Tx, number>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should know this number right? Can we hardcode it even if approximate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embarrassingly, I missed that it snuck this in. Will get a hardcoded number, makes sense

…constant

The per-instance WeakMap + process-wide largest-seen-size heuristic both
existed only to pre-size the BufferSink the no-sink Tx.toBuffer() path
allocates. The bootstrapped bench measures the actual Tx payloads at:

- private-only:               81763 bytes
- public-with-enqueued-calls: 129128 bytes

A single 131072-byte (128 KiB) presize covers both shapes without any
doubling-growth ensure() resize on the cold path, and is the same
allocation the WeakMap fast-path made on the steady-state hot path
anyway. Removing the hidden state matches Adam's review feedback and
brings the bench numbers within noise of the WeakMap version:

  variant            weakmap (prev)  constant (this)
  private steady     220 us          ~244 us
  public  steady     325 us          ~351 us
  private reused     167 us          ~176 us
  public  reused     266 us          ~276 us
  private cold       ~275 us         ~273 us
  public  cold       ~445 us         ~427 us

Real-world Txs that exceed 128 KiB keep working — the sink falls back
to its standard doubling growth, just paying the existing cost.
…ng value

The (0x0123456789abcdefn, 7) case was 57 bits (high byte 0x01) but width=7
only holds 56 bits. Legacy serializeBigInt silently truncates the high byte;
the new writeBigInt is strict and throws to match its 32-byte path. Drop the
overflowing high byte so the value fits, keeping the test's stated intent
(\"matches serializeBigInt byte-for-byte\") aligned with both impls. The
out-of-range strictness is already covered by the dedicated
\"rejects out-of-range bigints\" block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks. ci-squash-and-merge claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants